Jumpman NYC Expansion

       

Goal 1: Data Integrety Issues

   

PLEASE SEE ALL TABS :)

   

   

Summary: All Found Issues

  1. Item quantity, name and category is null

  2. Jumpman arrive at pick up location time is before the start of delivery time.

  3. Jumpman arrived and left pick-up time is null

  4. Place Category is null

  5. How Long It Took To Order is null

   

Issue 1: Item Quantity, Name, and Category is Null

Description of the issue: It appears that within some of the delivery rows there is a “N/A” value for the item quantity, category and name.

Percent of effected Records: 20.5 %

Why is this an issue: Sending a jumpman to a pickup location without any desired items to deliver undermines the value Application

Example:

Item quantity N/A example image

Item quantity N/A example image

Root of the issue:

Hypothesis for what might be causing this issue

  1. Cancellations: The mean delivery completion time and distance for rows with an empty order quantity field are significantly lower than rows with a filled item quantity field. This might imply that the user rethought their purchase and obtained the item them-self.
ggplot() + geom_density(data = numbered_quantity, aes(x=time_to_deliver,  fill= "Number"), adjust = 1, alpha=.3) + geom_density(data = empty_quantity, aes(x=time_to_deliver, fill="N/A"), adjust = 1, alpha=.3) + jhilliard_theme + scale_fill_manual(name="Item Quantity", values=c(Number="#56B4E9", "N/A"="#E69F00")) + labs(x="Minutes", y="Density", title="Time to Complete Delivery")+ xlim(0, 125)

ggplot() + geom_density(data = numbered_quantity, aes(x=distance_haversine_in_miles,  fill= "Number"), adjust = 1, alpha=.3) + geom_density(data = empty_quantity, aes(x=distance_haversine_in_miles, fill="N/A"), adjust = 1, alpha=.3) + jhilliard_theme + scale_fill_manual(name="Item Quantity", values=c(Number="#56B4E9", "N/A"="#E69F00")) + labs(x="Miles", y="Density", title="Delivery Distance")

one.sample.z(empty_quantity$time_to_deliver, mean(numbered_quantity$time_to_deliver), sigma = sd(numbered_quantity$time_to_deliver))
## 
## One sample z-test 
##              z*           P-value
##  -10.17845 mins 2.474603e-24 mins
one.sample.z(empty_quantity$distance_haversine_in_miles, mean(numbered_quantity$distance_haversine_in_miles), sigma = sd(numbered_quantity$distance_haversine_in_miles))
## 
## One sample z-test 
##        z*      P-value
##  -9.84389 7.283858e-23

What is not causing the issue

  1. The issue is not stemming from certain customers, jumpmen, pickup places, vehicle types or place categories as non of the values are unique to rows with empty item quantities.

  2. There doesn’t seem to be an issue with pickup and drop off locations as the locations for rows with empty item quantity fields and not empty quantity fields are spread throughout the city.

ny <- get_map(location = c(lon = -73.972026, lat = 40.745362), zoom = 12)
pickup_num  <- ggmap(ny, extent = "device", legend = "topleft") + geom_point( aes(x = pickup_lon, y = pickup_lat), size = .3, data = numbered_quantity) + labs(title= "Normal Pickup Locations")
pickup_empty <- ggmap(ny, extent = "device", legend = "topleft") + geom_point( aes(x = pickup_lon, y = pickup_lat), size = .3, data = empty_quantity)  + labs(title= "N/A Item Quantity Pickup Locations")
grid.arrange(pickup_num, pickup_empty, nrow=1, ncol=2)

pickup <- ggmap(ny, extent = "device", legend = "topleft") + geom_point( aes(x = dropoff_lon, y = dropoff_lat), size = .3, data = numbered_quantity) + labs(title= "Normal Dropoff Locations")
dropoff <- ggmap(ny, extent = "device", legend = "topleft") + geom_point( aes(x = dropoff_lon, y = dropoff_lat), size = .3, data = empty_quantity)  + labs(title= "N/A Item Quantity Dropoff Locations")
grid.arrange(pickup, dropoff, nrow=1, ncol=2)

   

Issue 2: Arrived and left pick-up time is null

Description of the issue: It appears that within some of the delivery rows there is a “N/A” value for the time the jumpman arrived at the pickup destination

Percent of effected Records: 9.2 %

Why is this an issue: Not knowing the amount of time spent at a pick up location can create unpredictability in wait time at pick up locations. It becomes difficult to calculate which pick locations are the most profitable.

Example:

Arrived and left pick-up time example image

Arrived and left pick-up time example image

Root of the issue:

Hypothesis for what might be causing this issue

  1. Jumpman forgets to notify app upon pickup location arrival: It is possible that jumpmen forget to update the application when they arrive at the pickup location. The app should notify jumpmen when their phone’s GPS comes within a reasonable radius of the pickup location. Also, we could send them a reminder to check-in after the median time (10 minutes) for jumpman to arrive at a pickup location. It is interesting to note that their is a significant difference in the time and distance of these specific orders. This is interesting because deliveries with pickup location take a shorter time to deliver, but are also a greater distance away from the customer.
median(pickup_time$distance_haversine_in_miles)
## [1] 0.8664454
median(no_pickup_time$distance_haversine_in_miles)
## [1] 0.610389
median(pickup_time$time_to_deliver)
## Time difference of 42.45873 mins
median(no_pickup_time$time_to_deliver)
## Time difference of 46.2639 mins
one.sample.z(no_pickup_time$time_to_deliver, mean(pickup_time$time_to_deliver), sigma = sd(pickup_time$time_to_deliver))
## 
## One sample z-test 
##             z*           P-value
##  4.190639 mins 2.781696e-05 mins
one.sample.z(no_pickup_time$distance_haversine_in_miles, mean(pickup_time$distance_haversine_in_miles), sigma = sd(pickup_time$distance_haversine_in_miles))
## 
## One sample z-test 
##         z*      P-value
##  -4.206264 2.596265e-05

What is not causing the issue

  1. Location of Pickup Place: Originally, I suspected a lack of cell service as the reason the jumpmen did not check in, but there is no noticeable difference between two cohort’s pick up place location.
# Since all these numbers are greater than 0, we can validate the statement above
ny <- get_map(location = c(lon = -73.972026, lat = 40.745362), zoom = 13)
pickup <- ggmap(ny, extent = "device", legend = "topleft") + stat_density2d( aes(x = pickup_lon, y = pickup_lat, alpha = ..level..), size = 2, bins = 4, data = no_pickup_time, geom = "polygon") + labs(title= "Density of Pickup Location \n with No Pickup Checkin Time")
dropoff <- ggmap(ny, extent = "device", legend = "topleft") + stat_density2d( aes(x = dropoff_lon, y = dropoff_lat, alpha = ..level..), size = 2, bins = 4, data = pickup_time, geom = "polygon")  + labs(title= "Density of Pickup Location \n with Pickup Checkin Time")
grid.arrange(pickup, dropoff, nrow=1, ncol=2)

   

Issue 3: Jumpmen Arrive at Pickup Location Time before Delivery Start Time

See: Other issues not broken down

Example:

Pickup time before order time example image

Pickup time before order time example image

   

Other Issues Not Broken Down

As this is an pre-interview assignment, I am not going to go in-depth on the other 3 issues that I found with the data.

  1. BIG ONE: JUMPMAN ARRIVE AT THE PICK UP LOCATION BEFORE THE DELIVERY IS STARTED

  2. Place Category is null

  3. How Long It Took To Order is null

   

Non Issues

  1. ID integrety: While there are duplicate delivery ids, this does not denote ID integrity issues. It denotes that a customer ordered multiple items from the same location in a single order. It is also worth noting, these duplicate delivery ids all correspond to a single customer id, jumpman id and start time.

   

   

High Level Summary

While deliveries per day and unique customers ordering per day is growing, customer acquisition is declining. This would suggest new customers are being retained and ordering more than once.

   

Goal 2: State of the Union (How are things going in NY?)

   

PLEASE SEE ALL TABS :)

   

   

For this analysis as to not have the “dirty” rows affect the findings, I have cleaned out rows from the Data set with pick up times that are before order times and rows with empty item quantity values.

Delivery Breakdown

Delivery Count Growth

ggplot(order_by_day_no_dup_delivery_id, aes(x = as.Date(date_no_time), y = count)) + geom_point() + geom_smooth(colour="darkgoldenrod1", size=1.5, method="loess", se=FALSE) + geom_smooth(method="lm", se=FALSE) + labs(x="Date", y="Number of Deliveries", title="Number of Deliveries per Day")+ jhilliard_theme

The growth of delivers is modest, but growing since the open of the NYC market. There are peaks that correspond with Sundays.

Delivery Dropoff and Pickup Location

pickup <- ggmap(ny, extent = "device", legend = "topleft") + stat_density2d( aes(x = pickup_lon, y = pickup_lat, alpha = ..level..), size = 2, bins = 4, data = jumpman_data_cleaned, geom = "polygon") + labs(title= "Density of Pickup location")
dropoff <- ggmap(ny, extent = "device", legend = "topleft") + stat_density2d( aes(x = dropoff_lon, y = dropoff_lat, alpha = ..level..), size = 2, bins = 4, data = jumpman_data_cleaned, geom = "polygon")  + labs(title= "Density of Dropoff Locations")
grid.arrange(pickup, dropoff, nrow=1, ncol=2)

Delivery pick up location concentration is higher than drop off location concentration. Customers tend to order from places in East Village and lower Manhattan, but orders get delivered to a much broader area. This makes sense, as East Village has a large concentration of shops and restaurants. One interesting thing of note is drop off location density in the upper east side, this MAY imply that the customer market segment tends to have higher incomes.

Deliveries By Time and Day

ggplot(na.omit(jumpman_day_hour), aes(x = hour_ordered, y = day_ordered, fill=count)) +
    geom_tile() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.6, size = 10)) +
    labs(x = "Hour of Request", y = "Day of Week of Request", title = "# of Delivery Requests in NYC, by Day and Time of Request") +
    scale_fill_gradient(low = "white", high = "#2980B9")

Orders tend to be concentrated around dinner time for all days of the week. Delivery times very more on weekends, but still seems to be concentrated around dinner. Weekends have more orders than weekdays.

Distance from Pickup to Dropoff Location

ggplot(data=jumpman_data_cleaned) + geom_histogram(aes(x=distance_haversine_in_miles), fill="springgreen4", binwidth = .15) + labs(x="Miles", y="Delivery Count", title="Distance from Pickup to Dropoff Location") + scale_fill_manual(name="Item Quantity", values=c("mean_line"="orange", "median_line"="purple")) + jhilliard_theme + geom_vline(aes(xintercept = mean(distance_haversine_in_miles), colour = "mean_line")) + geom_vline(aes(xintercept = median(distance_haversine_in_miles), colour = "median_line"))

There is a uni-modal distribution for pick up to drop off locations. Pick up to drop off location distance tend to be under 2 miles with a median of ~1 mile. Since some deliveries are further than 4 miles, it would be worth investigating delivery methods of those deliveries.

Time to Complete Delivery

ggplot(data=jumpman_data_cleaned) + geom_histogram(aes(x=time_to_deliver), fill="springgreen4", binwidth = 5) + labs(x="Minutes", y="Delivery Count", title="Time to Complete the Delivery") + scale_fill_manual(name="Item Quantity", values=c("mean_line"="darkblue", "median_line"="purple")) + jhilliard_theme + geom_vline(aes(xintercept = mean(time_to_deliver_num), colour = "mean_line")) + geom_vline(aes(xintercept = median(time_to_deliver_num), colour = "median_line"))

The distribution of delivery completion time is uni-modal with a median of ~45 minutes. There are some deliveries that took over an hour, it would be worth investigating those to see why they took longer than an hour.

Time to Pickup Item for Delivery

ggplot(data=pickup_location) + geom_histogram(aes(x=time_to_pickup/60), binwidth = 5, fill="springgreen4") + labs(x="Minutes", y="Delivery Count", title="Time to Reach Pickup Location") + scale_fill_manual(name="Item Quantity", values=c("mean_line"="darkblue", "median_line"="purple")) + jhilliard_theme + geom_vline(aes(xintercept = mean(time_to_pickup_num)/60, colour = "mean_line")) + geom_vline(aes(xintercept = median(time_to_pickup_num)/60, colour = "median_line"))

The distribution of pickup location arrival time is uni-modal with a median of ~20 minutes. The negative values in this column have been cleaned for this graph.

   

Customer Breakdown

Unique Customers Ordering per Day

ggplot(customer_by_day, aes(x = as.Date(date_no_time), y = count)) + geom_point() + geom_smooth(colour="darkgoldenrod1", size=1.5, method="loess", se=FALSE) + geom_smooth(method="lm", se=FALSE) + labs(x="Date", y="Number of Customers", title="Number of Unique Customers who Ordered in a given Day") + jhilliard_theme

New Customers per Day

The growth of unique customers orders per day is growing, but at a modest rate. There are peaks that correspond with Sundays. BUT…

ggplot(customer_first_order_day, aes(x = as.Date(earliest_order_date), y = count)) + geom_point() + geom_smooth(colour="darkgoldenrod1", size=1.5, method="loess", se=FALSE) + geom_smooth(method="lm", se=FALSE) + labs(x="Date", y="Number of New Customers", title="First Time Customers Who Ordered per Day") +jhilliard_theme

It is important to note, new customer acquisition has gone down since the launch of the NYC market. This means that new customer orders has gone down. We can be solve this by a customer acquisition campaign.

Customer Order Retention

ggplot(data = daily_retention) + geom_line(aes(y=Prc, x=date_diff, group=date1), colour="grey") + geom_smooth(aes(y=Prc, x=date_diff), colour="darkgoldenrod1", size=1.5, se=FALSE) + labs(y="Percent Retained", x="Days Since Order", title="Likelihood of an Order N Days after a Seperate Order \n from a Unique Customer") + jhilliard_theme

The expected likelihood of another order coming in n days after an order is ~3%. This means, if you buy 100 ordering customers through an acquisition campaign, it would not be unreasonable to expect, this set of customers would generate ~3 returning customers per day (for the foreseeable future) after initially acquired. While this doesn’t tell the whole customer retention story, it is a nice start.

I ignore the increasing trend at the end of the graph because it is likely caused by the early adopter bias.

Fun Bonus Map: Top 10 Percentile of Pickup Locations

Top 10 Percentile of Pickup Locations as determined by number of items ordered.

map <- leaflet()
map <- addTiles(map)
order_place <- aggregate(cbind(count = item_quantity) ~ pickup_place + pickup_lat + pickup_lon, 
          data = jumpman_data_cleaned, 
          FUN = function(x){ NROW(x) })
order_place <- data.frame(order_place)
order_place <- order_place %>% rowwise() %>% mutate(popup = paste(pickup_place, toString(count), sep=" : ") )
order_place <- order_place[ which(order_place$count >= 21), ]
map <- addMarkers(map, lng=order_place$pickup_lon, lat=order_place$pickup_lat, popup=order_place$popup)
map

This map shows the top 10% of places people placed that an ordered an item. The user can click on the marker to get the name of the restaurant and the number of items ordered from that location.